Pendahuluan Data Mining
https://taudata.blogspot.com/p/applied-data-mining-adm.html
Supervised Learning - Classification 02
https://taudata.blogspot.com/2022/04/slcm-03.html
(C) Taufik Sutanto
Notes and Disclaimer¶
- This notebook is part of the free (open knowledge) eLearning course at: https://tau-data.id
- Some images are taken from several resources, we respect those images ownerships and put a reference/citation from where it is originated. Nevertheless, sometimes we are having trouble to find the origin of the image(s). If you are the owner of the image and would like the image taken-out (or want the citation to be revised) from this open knowledge course resources please contact us here with the details: https://tau-data.id/contact/
- Unless stated otherwise, in general tau-data permit its resources to be copied and-or modified for non-commercial purposes. With condition proper acknowledgement/citation is given.
Outline:¶
- Review Materi Sebelumnya
- Support Vector Machines
- Neural Network
- Ensemble Models
- Imbalance Data Problem
# Importing Modules untuk Notebook ini
import warnings; warnings.simplefilter('ignore')
import numpy as np, matplotlib.pyplot as plt, pandas as pd, seaborn as sns
from matplotlib.colors import ListedColormap
from sklearn import svm, preprocessing
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import make_blobs, make_moons, make_circles, make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.ensemble import VotingClassifier
from sklearn import model_selection
from collections import Counter
sns.set(style="ticks", color_codes=True)
"Done"
'Done'
k-Nearest Neighbour¶
Logistic Regression¶
Teori Decision Tree : Information theory¶
Naive Bayes Classifier¶
- P(x) konstan, sehingga bisa diabaikan.
- Asumsi terkuatnya adalah independensi antar variabel prediktor (sehingga dikatakan "Naive")
- Klasifikasi dilakukan dengan menghitung probabilitas untuk setiap kategori ketika diberikan data x = (x1,x2,...,xm)
- Untuk data yang besar bisa menggunakan out-of-core approach (partial fit):
http://scikit-learn.org/stable/modules/scaling_strategies.html#scaling-strategies - Variasi NBC adalah bagaimana P(c|x) dihitung, misal dengan distribusi Gaussian (Normal) - sering disebut sebagai Gaussian Naive Bayes (GNB):
Support Vector Machine (SVM)¶
Misal data dinyatakan sebagai berikut: $\{(\bar{x}_1,y_1),...,(\bar{x}_n,y_n)\}$, dimana $\bar{x}_i$ adalah input pattern untuk data ke $i^{th}$ dan $y_i$ adalah nilai target yang diinginkan. Kategori (class) direpresentasikan dengan $y_i=\{-1,1\}$. Sebuah bidang datar (hyperplane) yang memisahkan kedua kelas ini ("linearly separable") adalah: $$ \bar{w}'\bar{x}+b=0 $$ dimana $\bar{x}$ adalah input vector (prediktor), $\bar{w}$ weight, dan $b$ disebut sebagai bias.
Pemodelan SVM (Hard Margin):¶
Support Vector Machine: Soft Margin¶
C >>> ==> toleransi terhadap outlier <<<< dan sebaliknya¶
Dual dan Quadratic solver¶
- optimasi di atas biasanya diselesaikan dengan mencari bentuk Dual-nya.
- Solusi untuk parameter optimalnya kemudian ditemukan dengan mencari pendekatan nilai optimalnya lewat Quadratic Programming solver.
- Perhatikan bahwa bentuk fungsi optimasinya konvex ==> memiliki minimum global.
- Nilai optimal dari pemodelan di atas hanya bergantung pada data-data di margin (support vector) sehingga bisa lebih efisien (jika SV telah diketahui).
- SV juga dapat digunakan untuk menganalisa "Error Bound" : http://www.svms.org/vc-dimension/
Interpretation¶
- Recursive Feature Elimination (RFE) method : https://link.springer.com/content/pdf/10.1023/A:1012487302797.pdf
- melihat bentuk kuadrat dari setiap komponen w (higher better).
- hati-hati beberapa diskusi di internet menyatakan bahwa sign (+/-) menyatakan tingkat kepentingan terhadap setiap variabel, namun hal ini tidak selalu benar dan bisa dibuktikan cukup dengan counter example.
Bagaimana dengan data kategorik?¶
- Sama dengan regresi (logistik) ==> Dummy (indicator variable) variable.
- Misal X1 = {a,b,c} ==> X1_a = [1,0,0], X1_b = [0,1,0], X1_c = [0,0,1]
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html
# Contoh
df = pd.DataFrame({'X1': ['a', 'b', 'a','c','a'],'X2': [1, 2, 3, 2, 1]})
df.head()
| X1 | X2 | |
|---|---|---|
| 0 | a | 1 |
| 1 | b | 2 |
| 2 | a | 3 |
| 3 | c | 2 |
| 4 | a | 1 |
df = pd.get_dummies(df)
df.head()
| X2 | X1_a | X1_b | X1_c | |
|---|---|---|---|---|
| 0 | 1 | True | False | False |
| 1 | 2 | False | True | False |
| 2 | 3 | True | False | False |
| 3 | 2 | False | False | True |
| 4 | 1 | True | False | False |
Normalisasi/Standarisasi Data¶
- Sama seperti Regresi (logistik) prediktor/features di model SVM perlu untuk di standarisasi/normalisasi.
- http://scikit-learn.org/stable/modules/preprocessing.html#scaling-features-to-a-range
- Hati-hati standarisasi data dilakukan setelah outlier ditangani dengan baik.
scaler = preprocessing.StandardScaler(with_mean=True, with_std=True)
df['X2'] = scaler.fit_transform(df[['X2']])
df
| X2 | X1_a | X1_b | X1_c | |
|---|---|---|---|---|
| 0 | -1.069045 | True | False | False |
| 1 | 0.267261 | False | True | False |
| 2 | 1.603567 | True | False | False |
| 3 | 0.267261 | False | False | True |
| 4 | -1.069045 | True | False | False |
# Contoh plotting Optimal Hyperplane
# http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane.html#example-svm-plot-separating-hyperplane-py
X, y = make_blobs(n_samples=20, centers=2, random_state=6) # we create 20 separable points
clf = svm.SVC(kernel='linear', C=1000) # fit the model, don't regularize for illustration purposes
clf.fit(X, y)
plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=plt.cm.Paired)
ax = plt.gca();xlim = ax.get_xlim(); ylim = ax.get_ylim()
# create grid to evaluate model
xx = np.linspace(xlim[0], xlim[1], 30);yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = clf.decision_function(xy).reshape(XX.shape)
ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], alpha=0.5,linestyles=['--', '-', '--'])# plot decision boundary and margins
ax.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=100,linewidth=1, facecolors='none', edgecolors='k')# plot support vectors
plt.show()
SVM Kernel (trick)
Definisi Fungsi Kernel¶
- Jika untuk semua $\bar{x},\bar{z} \in X$, memenuhi
maka $\kappa$ disebut fungsi Kernel (fungsi $\phi$ disebut feature map).
- Perhatikan hasil pemetaan fungsi kernelnya adalah scalar (inner product).
- Fungsi ini digunakan di SVM (dan model DM/ML lain yang bisa dinyatakan dalam inner product).
- Perhatikan pemodelan SVM; kebanyakan dinyatakan dalam inner product (i.e. w.x).
- See here for more details: https://nlp.stanford.edu/IR-book/html/htmledition/nonlinear-svms-1.html
Contoh : Lagrangian Wolfe Dual dari Optimasi diatas
Contoh 1¶
- Misal $X\subseteq \Re^2$ dan $\phi : \bar{x}=(x_1,x_2)\rightarrow \phi (\bar{x})=(x_1^2,
x_2^2,\sqrt{2}x_1x_2)\in F=\Re^3$.
- maka
$<\phi(\bar{x}),\phi(\bar{z})>$
$=<(x_1^2,x_2^2,\sqrt{2}x_1x_2),(z_1^2,z_2^2,\sqrt{2}z_1z_2)>$
$=x_1^2z_1^2+x_2^2z_2^2+2x_1x_2z_1z_2$
$=(x_1z_1+x_2z_2)^2=<\bar{x},\bar{z}>^2$
- Sehingga $\kappa(\bar{x},\bar{z})=<\bar{x},\bar{z}>^2$ adalah sebuah fungsi kernel dan $F$ adalah ruang feature-nya (feature space).
Contoh 2¶
- Misal x = (x1, x2, x3); y = (y1, y2, y3).
- dan fungsi pemetaan variabelnya f(x) = (x1², x1x2, x1x3, x2x1, x2², x2x3, x3x1, x3x2, x3²),
- maka kernelnya adalah K(x, y ) = <f(x), f(y)> = <x, y>².
- Contoh numerik misal x = (1, 2, 3) dan y = (4, 5, 6). maka:
- f(x) = (1, 2, 3, 2, 4, 6, 3, 6, 9)
f(y) = (16, 20, 24, 20, 25, 30, 24, 30, 36) - <f(x), f(y)> = 16 + 40 + 72 + 40 + 100+ 180 + 72 + 180 + 324 = 1024
- complicated!... Menggunakan fungsi kernel perhitungannya bisa disederhanakan:
- K(x, y) = (4 + 10 + 18)² = 32² = 1024
Well-Known Kernel Functions
SVM Binary to MultiClass
Pros
- Akurasinya Baik
- Bekerja dengan baik untuk sampel data yang relatif kecil
- Hanya bergantung pada SV ==> meningkatkan efisiensi
- Convex ==> Minimum Global ==> Pasti Konvergen
Cons
- Tidak efisien untuk data yang besar
- Akurasi terkadang rendah untuk multiklasifikasi (sulit mendapatkan hubungan antar kategori di modelnya)
- Tidak robust terhadap noise
import warnings; warnings.simplefilter('ignore')
import pandas as pd, seaborn as sns
# load the iris data
df = sns.load_dataset("iris")
g = sns.pairplot(df, hue="species")
df.sample(7)
| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 32 | 5.2 | 4.1 | 1.5 | 0.1 | setosa |
| 66 | 5.6 | 3.0 | 4.5 | 1.5 | versicolor |
| 18 | 5.7 | 3.8 | 1.7 | 0.3 | setosa |
| 118 | 7.7 | 2.6 | 6.9 | 2.3 | virginica |
| 83 | 6.0 | 2.7 | 5.1 | 1.6 | versicolor |
| 135 | 7.7 | 3.0 | 6.1 | 2.3 | virginica |
| 55 | 5.7 | 2.8 | 4.5 | 1.3 | versicolor |
df.describe(include= 'all')
| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| count | 150.000000 | 150.000000 | 150.000000 | 150.000000 | 150 |
| unique | NaN | NaN | NaN | NaN | 3 |
| top | NaN | NaN | NaN | NaN | setosa |
| freq | NaN | NaN | NaN | NaN | 50 |
| mean | 5.843333 | 3.057333 | 3.758000 | 1.199333 | NaN |
| std | 0.828066 | 0.435866 | 1.765298 | 0.762238 | NaN |
| min | 4.300000 | 2.000000 | 1.000000 | 0.100000 | NaN |
| 25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 | NaN |
| 50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 | NaN |
| 75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 | NaN |
| max | 7.900000 | 4.400000 | 6.900000 | 2.500000 | NaN |
# Separate the data
df2 = df[df['species'].isin(['versicolor', 'setosa'])] # Ambil binary contoh 2 kelas saja
X = df2[['sepal_length','sepal_width','petal_length','petal_width']]
Y = df2['species']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.3)
print(X_train.shape, X_test.shape)
(70, 4) (30, 4)
Y
0 setosa
1 setosa
2 setosa
3 setosa
4 setosa
...
95 versicolor
96 versicolor
97 versicolor
98 versicolor
99 versicolor
Name: species, Length: 100, dtype: object
# Fitting and evaluate the model
dSVM = svm.SVC(C = 10**5, kernel = 'linear')
dSVM.fit(X_train, Y_train)
y_SVM = dSVM.predict(X_test)
print('Akurasi = ', accuracy_score(Y_test, y_SVM))
print(confusion_matrix(Y_test, y_SVM))
print(classification_report(Y_test, y_SVM))
Akurasi = 1.0
[[17 0]
[ 0 13]]
precision recall f1-score support
setosa 1.00 1.00 1.00 17
versicolor 1.00 1.00 1.00 13
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
# The Support Vectors
print('index dr SV-nya: ', dSVM.support_)
print('Vector Datanya: \n', dSVM.support_vectors_)
index dr SV-nya: [ 6 41 39] Vector Datanya: [[4.5 2.3 1.3 0.3] [5.1 3.3 1.7 0.5] [5.1 2.5 3. 1.1]]
# Model Weights for interpretations
print('w = ',dSVM.coef_)
print('b = ',dSVM.intercept_)
w = [[ 0.04621298 -0.52129234 1.00306886 0.46413981]] b = [-1.45238036]
# Menggunakan Kernel: http://scikit-learn.org/stable/modules/svm.html#svm-kernels
for kernel in ('sigmoid', 'poly', 'rbf'):
dSVM = svm.SVC(kernel=kernel)
dSVM.fit(X_train, Y_train)
y_SVM = dSVM.predict(X_test)
print(accuracy_score(Y_test, y_SVM))
0.43333333333333335 1.0 1.0
# Contoh Multiklasifikasi SVM (dengan dan tanpa kernel)
X = df[['sepal_length','sepal_width','petal_length','petal_width']]
Y = df['species'] # Menggunakan seluruh spesies (3 kategori)
X = preprocessing.StandardScaler().fit_transform(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.3)
print(X_train.shape, X_test.shape)
df.describe(include='all')
(105, 4) (45, 4)
| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| count | 150.000000 | 150.000000 | 150.000000 | 150.000000 | 150 |
| unique | NaN | NaN | NaN | NaN | 3 |
| top | NaN | NaN | NaN | NaN | setosa |
| freq | NaN | NaN | NaN | NaN | 50 |
| mean | 5.843333 | 3.057333 | 3.758000 | 1.199333 | NaN |
| std | 0.828066 | 0.435866 | 1.765298 | 0.762238 | NaN |
| min | 4.300000 | 2.000000 | 1.000000 | 0.100000 | NaN |
| 25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 | NaN |
| 50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 | NaN |
| 75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 | NaN |
| max | 7.900000 | 4.400000 | 6.900000 | 2.500000 | NaN |
set(Y_train)
{'setosa', 'versicolor', 'virginica'}
# One Versus All: http://www.jmlr.org/papers/volume5/rifkin04a/rifkin04a.pdf
dSVM = svm.LinearSVC()
dSVM.fit(X_train, Y_train)
y_SVM = dSVM.predict(X_test)
print('Akurasi = ', accuracy_score(Y_test, y_SVM))
y_SVM
Akurasi = 0.9555555555555556
array(['versicolor', 'versicolor', 'setosa', 'virginica', 'setosa',
'versicolor', 'setosa', 'setosa', 'virginica', 'versicolor',
'setosa', 'virginica', 'versicolor', 'versicolor', 'setosa',
'virginica', 'versicolor', 'virginica', 'setosa', 'virginica',
'versicolor', 'virginica', 'setosa', 'setosa', 'virginica',
'virginica', 'versicolor', 'versicolor', 'setosa', 'virginica',
'setosa', 'setosa', 'versicolor', 'versicolor', 'setosa', 'setosa',
'versicolor', 'setosa', 'versicolor', 'setosa', 'setosa',
'versicolor', 'setosa', 'virginica', 'versicolor'], dtype=object)
# Ada 3 classifier (as expected)
dSVM.coef_
array([[-0.22672574, 0.41106345, -0.6761091 , -0.63330724],
[ 0.01212648, -0.43968806, 0.74023292, -0.75194366],
[-0.09711928, -0.43961137, 1.61658978, 1.39170709]])
# All At Once Method http://www.jmlr.org/papers/volume2/crammer01a/crammer01a.pdf
dSVM = svm.SVC(decision_function_shape='ovo')
dSVM.fit(X_train, Y_train)
y_SVM = dSVM.predict(X_test)
print('Akurasi = ', accuracy_score(Y_test, y_SVM))
y_SVM
Akurasi = 0.9333333333333333
array(['versicolor', 'versicolor', 'setosa', 'virginica', 'setosa',
'versicolor', 'setosa', 'setosa', 'versicolor', 'versicolor',
'setosa', 'virginica', 'versicolor', 'virginica', 'setosa',
'virginica', 'versicolor', 'virginica', 'setosa', 'virginica',
'versicolor', 'virginica', 'setosa', 'setosa', 'virginica',
'virginica', 'versicolor', 'versicolor', 'setosa', 'virginica',
'setosa', 'setosa', 'versicolor', 'virginica', 'setosa', 'setosa',
'versicolor', 'setosa', 'versicolor', 'setosa', 'setosa',
'versicolor', 'setosa', 'virginica', 'versicolor'], dtype=object)
Artificial Neural Network - Jaringan Syaraf Tiruan
Toy Data Example Neural Network (Back Propagation)
Multiclass ANN¶
Melihat pemodelan Matematis dan cara kerja Neural Network, apakah kita perlu melakukan standarisasi data juga seperti SVM dan Regresi Logistic?¶
Neural Network - Empirical Analysis Parameter di ANN
https://goo.gl/3rcnc9Mengapa dengan fungsi linear bisa membentuk "boundary" yang melengkung (kurva)?
http://s.id/j6iNeural Network VS Deep Learning¶
# Neural Network: http://scikit-learn.org/stable/modules/neural_networks_supervised.html
NN = MLPClassifier(hidden_layer_sizes=(100,))# 2 layers 30 neurons and 20 neurons
NN.fit(X_train, Y_train)
y_NN = NN.predict(X_test)
print('Akurasi = ', accuracy_score(Y_test, y_NN))
Akurasi = 0.9333333333333333
Induktif bias :¶
- Bias penaksiran parameter (statistik)
- Induktif Bias Sample (Machine Learning - Tom Mitchel)
- Induktif Bias Pemilihan Classifier (Statistical Learning Theory - Vapnik)
h, i = .02, 1 # step size in the mesh , iterate over datasets
names = ["Nearest Neighbors", "Logistic Regression", "Naive Bayes", "Linear SVM", "RBF SVM",
"Decision Tree", "Random Forest", "Neural Net"]
classifiers = [KNeighborsClassifier(3),
LogisticRegression(solver='lbfgs',multi_class='multinomial'),
GaussianNB(), SVC(kernel="linear", C=0.025), SVC(gamma=2, C=1),
DecisionTreeClassifier(max_depth=5),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
MLPClassifier(alpha=1)]
X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,random_state=1, n_clusters_per_class=1)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)
datasets = [make_moons(noise=0.3, random_state=0),make_circles(noise=0.2, factor=0.5, random_state=1),linearly_separable]
figure = plt.figure(figsize=(27, 9))
for ds_cnt, ds in enumerate(datasets):
# preprocess dataset, split into training and test part
X, y = ds
X = preprocessing.StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4, random_state=42)
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))
# just plot the dataset first
cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
if ds_cnt == 0:
ax.set_title("Input data")
# Plot the training points
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright,
edgecolors='k')
# Plot the testing points
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6,
edgecolors='k')
ax.set_xlim(xx.min(), xx.max()); ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(()); ax.set_yticks(())
i += 1
# iterate over classifiers
for name, clf in zip(names, classifiers):
ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
if hasattr(clf, "decision_function"):
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
else:
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
# Put the result into a color plot
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)
# Plot the training points
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright,edgecolors='k')
# Plot the testing points
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,edgecolors='k', alpha=0.6)
ax.set_xlim(xx.min(), xx.max());ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(()); ax.set_yticks(())
if ds_cnt == 0:
ax.set_title(name)
ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),
size=15, horizontalalignment='right')
i += 1
plt.tight_layout();plt.show()
Ensemble Model
- What? a learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions.
- Why? Better prediction, More stable model
- How? Bagging & Boosting
“meta-algorithms” : Bagging & Boosting¶
# Contoh Voting (Bagging) di Python
# Catatan : Random Forest termasuk Bagging Ensemble (walau modified)
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
file = 'data/diabetes_data.csv'
try:
# Local jupyter notebook, assuming "file" is in the "data" directory
data = pd.read_csv(file, names=names).values # Rubah ke numpy array
except:
# it's a google colab... create folder data and then download the file from github
!mkdir data
!wget -P data/ https://raw.githubusercontent.com/taudataanalytics/Data-Mining--Penambangan-Data--Ganjil-2024/master/data/diabetes_data.csv
data = pd.read_csv(file, names=names).values # Rubah ke numpy array
X, Y = data[:,0:8], data[:,8] # Slice
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.3)
kNN = KNeighborsClassifier(3)
kNN.fit(X_train, Y_train)
Y_kNN = kNN.score(X_test, Y_test)
DT = DecisionTreeClassifier(random_state=1)
DT.fit(X_train, Y_train)
Y_DT = DT.score(X_test, Y_test)
model = VotingClassifier(estimators=[('k-NN', kNN), ('Decision Tree', DT)], voting='hard')
model.fit(X_train,Y_train)
Y_Vot = model.score(X_test,Y_test)
print('Akurasi k-NN', Y_kNN)
print('Akurasi Decision Tree', Y_DT)
print('Akurasi Votting', Y_Vot)
Akurasi k-NN 0.696969696969697 Akurasi Decision Tree 0.70995670995671 Akurasi Votting 0.7142857142857143
# Averaging juga bisa digunakan di Klasifikasi (ndak hanya Regresi),
# tapi kita pakai probabilitas dari setiap kategori
T = DecisionTreeClassifier()
K = KNeighborsClassifier()
R= LogisticRegression()
T.fit(X_train,Y_train)
K.fit(X_train,Y_train)
R.fit(X_train,Y_train)
y_T=T.predict_proba(X_test)
y_K=K.predict_proba(X_test)
y_R=R.predict_proba(X_test)
Ave = (y_T+y_K+y_R)/3
print(Ave[:5]) # Print just first 5
prediction = [v.index(max(v)) for v in Ave.tolist()]
print(prediction[:5]) # Print just first 5
print('Akurasi Averaging', accuracy_score(Y_test, prediction))
[[0.27676107 0.72323893] [0.06742822 0.93257178] [0.29979175 0.70020825] [0.65073792 0.34926208] [0.98164276 0.01835724]] [1, 1, 1, 0, 0] Akurasi Averaging 0.7402597402597403
# AdaBoost
num_trees = 100
kfold = model_selection.KFold(n_splits=10, random_state=9, shuffle=True)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=1)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
0.746120984278879
Imbalance Data
* Metric Trap * Akurasi kategori tertentu lebih penting * Contoh kasus- Undersampling
- Oversampling
- Model Based (weight adjustment)
data = pd.read_csv(file, names=names)
data.head()
| preg | plas | pres | skin | test | mass | pedi | age | class | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
| 1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
| 2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
| 3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
| 4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
plot = data["class"].value_counts().plot(kind='pie')
# Example of model-based imbalance treatment - SVM
n_samples_1, n_samples_2 = 1000, 100
centers = [[0.0, 0.0], [2.0, 2.0]]
clusters_std = [1.5, 0.5]
X, y = make_blobs(n_samples=[n_samples_1, n_samples_2],centers=centers,cluster_std=clusters_std,random_state=0, shuffle=False)
# fit the model and get the separating hyperplane
clf = svm.SVC(kernel='linear', C=1.0)
clf.fit(X, y)
# fit the model and get the separating hyperplane using weighted classes
wclf = svm.SVC(kernel='linear', class_weight={1: 10}) #WEIGHTED SVM
wclf.fit(X, y)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired, edgecolors='k')# plot the samples
ax = plt.gca()# plot the decision functions for both classifiers
xlim = ax.get_xlim(); ylim = ax.get_ylim()
xx = np.linspace(xlim[0], xlim[1], 30)# create grid to evaluate model
yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = clf.decision_function(xy).reshape(XX.shape)# get the separating hyperplane
a = ax.contour(XX, YY, Z, colors='k', levels=[0], alpha=0.5, linestyles=['-']) # plot decision boundary and margins
Z = wclf.decision_function(xy).reshape(XX.shape)# get the separating hyperplane for weighted classes
b = ax.contour(XX, YY, Z, colors='r', levels=[0], alpha=0.5, linestyles=['-'])# plot decision boundary and margins for weighted classes
plt.legend([a.collections[0], b.collections[0]], ["non weighted", "weighted"], loc="upper right")
plt.show()
Weighted Decision Tree¶
data = pd.read_csv(file, names=names).values # Rubah ke numpy array
X, Y = data[:,0:8], data[:,8] # Slice
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.3)
del T
T = DecisionTreeClassifier(random_state = 0)
T.fit(X_train,Y_train)
y_DT = T.predict(X_test)
print('Akurasi (Decision tree Biasa) = ', accuracy_score(Y_test, y_DT))
print(classification_report(Y_test, y_DT))
del T
T = DecisionTreeClassifier(class_weight = 'balanced', random_state = 0)
T.fit(X_train,Y_train)
y_DT = T.predict(X_test)
print('Akurasi (Weighted Decision tree) = ', accuracy_score(Y_test, y_DT))
print(classification_report(Y_test, y_DT))
Akurasi (Decision tree Biasa) = 0.6883116883116883
precision recall f1-score support
0.0 0.75 0.75 0.75 146
1.0 0.58 0.58 0.58 85
accuracy 0.69 231
macro avg 0.66 0.66 0.66 231
weighted avg 0.69 0.69 0.69 231
Akurasi (Weighted Decision tree) = 0.7142857142857143
precision recall f1-score support
0.0 0.77 0.77 0.77 146
1.0 0.61 0.61 0.61 85
accuracy 0.71 231
macro avg 0.69 0.69 0.69 231
weighted avg 0.71 0.71 0.71 231